## [1] "/Users/mohitsrivastava/Desktop/Udacity/EDA_R_Project2_Term2"
## [1] "Explore_And_Summarize.Rmd" "projecttemplate.rmd"
## [3] "wineQualityReds.csv"
I am going to explore the Quality of red wines
As we can see above most of the wines fall in quality range 5 to 6 .
Above we can see the alcohol content of 9.6 shows maximum counts of red wine.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Using as id variables
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From Above plots we can see that residual.sugar and chlorides show right skewed plot, while rest others are normally distributed.
The main feature of interest is the quality of wine and how it will depend on the different other variables like residual sugar,alcohol content, density etc.
Have not used any variation in the dataset. Didn’t made any variable out of the given variables.
## Warning: Removed 41 rows containing missing values (geom_point).
## Warning: Removed 69 rows containing missing values (geom_point).
##
## Pearson's product-moment correlation
##
## data: alcohol and quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
##
## Pearson's product-moment correlation
##
## data: density and quality
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2220365 -0.1269870
## sample estimates:
## cor
## -0.1749192
##
## Pearson's product-moment correlation
##
## data: pH and quality
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.106451268 -0.008734972
## sample estimates:
## cor
## -0.05773139
##
## Pearson's product-moment correlation
##
## data: total.sulfur.dioxide and quality
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2320162 -0.1373252
## sample estimates:
## cor
## -0.1851003
##
## Pearson's product-moment correlation
##
## data: citric.acid and quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
##
## Pearson's product-moment correlation
##
## data: fixed.acidity and quality
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07548957 0.17202667
## sample estimates:
## cor
## 0.1240516
##
## Pearson's product-moment correlation
##
## data: residual.sugar and quality
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03531327 0.06271056
## sample estimates:
## cor
## 0.01373164
##
## Pearson's product-moment correlation
##
## data: sulphates and quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
Through above correlations we found out that the alcohol,sulphates, and citric acid plays important part in deciding the quality of red wine. Below making few plots to show the relation.
so we see that higher sulphate content wines are of better quality.
Here we see that better quality wines have high citric acid content.
To test the relationships between various variables we will try to find out the correlation between 2 variables and will try to find out if the relation is strong or not.
##
## Pearson's product-moment correlation
##
## data: pH and fixed.acidity
## t = -37.366, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7082857 -0.6559174
## sample estimates:
## cor
## -0.6829782
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
As expected, the acidity is decreasing as the pH is increasing.
##
## Pearson's product-moment correlation
##
## data: total.sulfur.dioxide and residual.sugar
## t = 8.2861, df = 1597, p-value = 2.449e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1555549 0.2495652
## sample estimates:
## cor
## 0.2030279
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 125 rows containing non-finite values (stat_smooth).
## Warning: Removed 125 rows containing missing values (geom_point).
As expected the sulfur and sugar contents are not so very strongly correlated to each other. The graph doesnt show any big spikes in sulfur as sugar content is increasing. So can’t really say anything about this relationship.
The strongest relationship found is between pH to acidity and alcohol with respect to quality.
As we can see that the best quality wines have good amount of alcohol ranging more than 10 as well as some amount of citric acid. Lowe quality alcohol have more of citric acid than alcohol content.
# Multivariate Analysis
The most strongest features for quality of red wines is alcohol content rest others factors are secondary and doesnt matter much for quality.
The surprising factor is that its only alcohol which is directly influencing the quality of wines. Other factors as pH fixed acidity do impact but few wines still are of bad quality even though the alcohol pH and acidity are as they shall be.
Have not created any Model for the dataset.
Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.
Most of the wines have quality of 5,6 rating. Highest quality wines are scarcely available.
The best quality leads to high alcohol content . The alcohol content is high even for medium quality wines.
The good alcohol content the better quality but citric acid is also an important factor to make good quality wines along with alcohol content. Its clearly visible quality increases if the citric acid and alcoho both are mixed in right proportion. ——
The exploration lead to the factor that alcohol and citric acid played well to get better quality wines. Main struggle was that of all the variables hardly 1-2 factors actually have strong correlation with quality. So having intution that acidity, sulfur, sugar content are strongly related to wine was totally negated. Surprising was acidity didnt had much to do with quality. We can make a data model for our dataset which i have not created.